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Abstract 

This paper introduces AdaSDCA: an adap¬ 
tive variant of stochastic dual coordinate as¬ 
cent (SDCA) for solving the regularized empir¬ 
ical risk minimization problems. Our modifica¬ 
tion consists in allowing the method adaptively 
change the probability distribution over the dual 
variables throughout the iterative process. AdaS¬ 
DCA achieves provably better complexity bound 
than SDCA with the best fixed probability dis¬ 
tribution, known as importance sampling. How¬ 
ever, it is of a theoretical character as it is expen¬ 
sive to implement. We also propose AdaSDCA+: 
a practical variant which in our experiments out¬ 
performs existing non-adaptive methods. 

1. Introduction 


for all w\,W 2 £ dom g, 0 < a < 1 and w = aw\ + (1 — 
a)w2- 

The ERM problem (1) has received considerable attention 
in recent years due to its widespread usage in supervised 
statistical learning (Shalev-Shwartz & Zhang, 2013b). Of¬ 
ten, the number of samples n is very large and it is im¬ 
portant to design algorithms that would be efficient in this 
regime. 

Modern stochastic algorithms for ERM. Several 
highly efficient methods for solving the ERM prob¬ 
lem were proposed and analyzed recently. These 
include primal methods such as SAG (Schmidt et al., 

2013) , SVRG (Johnson & Zhang, 2013), S2GD 
(Konecny & Richtarik, 2014), SAGA (Defazio et ah, 

2014) , mS2GD (Konecny et ah, 2014a) and MISO (Mairal, 
2014). Importance sampling was considered in ProxSVRG 
(Xiao & Zhang, 2014) and S2CD (Konecny et ah, 2014b). 


Empirical Loss Minimization. In this paper we consider 
the regularized empirical risk minimization problem: 


min 

«jGR d 


■1 

p {w) = - X w ) + 

n 

i=1 


(i) 


In the context of supervised learning, w is a linear predictor, 
Ai,..., A n £ R d are samples, <j>i,... ,<f> n : R d — > R are 
loss functions, g : R rf — > R is a regularizer and A > 0 a 
regularization parameter. Hence, we are seeking to identify 
the predictor which minimizes the average (empirical) loss 
P(w). 


We assume throughout that the loss functions are I/ 7 - 
smooth for some 7 > 0. That is, we assume they are 
differentiable and have Lipschitz derivative with Lipschitz 
constant 1 /"/: 

\4>'{a) - < —\a — b\ 

7 

for all a, b £ R. Moreover, we assume that g is 1-strongly 
convex with respect to the L 2 norm: 

g(w) < ag{w 1) + (1 - a)g(w 2) - Q ^ 2 Iki - w 2 \\ 2 


Stochastic Dual Coordinate Ascent. One of the most suc¬ 
cessful methods in this category is stochastic dual coor¬ 
dinate ascent (SDCA), which operates on the dual of the 
ERM problem (1): 


max 
a=(ai,...,a n )€l n - 


D(a) =-/(a) - ip(a) 

where functions / and ip are defined by 

' 1 ~ 




1 " 

V’(a) = - 

n ^ 


i= 1 


( 2 ) 

(3) 

(4) 


and g* and 0* are the convex conjugates 1 of g and 0 ,, re¬ 
spectively. Note that in dual problem, there are as many 
variables as there are samples in the primal: a £ R n . 

SDCA in each iteration randomly selects a dual variable 
and performs its update, usually via closed-form 

'By the convex (Fenchel) conjugate of a function h : 
R k -> Iwe mean the function h* : —>• R defined by 

h*(u ) = sup s {s T « — fi(s)}. 
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formula - this strategy is know as randomized coor¬ 
dinate descent. Methods based on updating randomly 
selected dual variables enjoy, in our setting, a linear 
convergence rate (Shalev-Shwartz & Zhang, 2013b; 2012; 
Takac et al., 2013; Shalev-Shwartz & Zhang, 2013a; 
Zhao & Zhang, 2014; Qu et al., 2014). These methods 
have attracted considerable attention in the past few 
years, and include SCD (Shalev-Shwartz & Tewari, 2011), 
RCDM (Nesterov, 2012), UCDC (Richtarik & Takac, 
2014), ICD (Tappenden et al., 2013), PCDM 
(Richtarik & Takac, 2012), SPCDM (Fercoq & Richtarik, 

2013) , SPDC (Zhang & Xiao, 2014), APCG (Linetal., 

2014) , RCD (Necoara & Patrascu, 2014), APPROX 
(Fercoq & Richtarik, 2013), QUARTZ (Quetal., 
2014) and ALPHA (Qu & Richtarik, 2014). Re¬ 
cent advances on mini-batch and distributed variants 
can be found in (Liu & Wright, 2014), (Zhaoetal., 
2014b), (Richtarik & Takac, 2013a), (Fercoq et al., 
2014), (Trofimov & Genkin, 2014), (Jaggi et al., 2014), 
(Marecek et al., 2014) and (Mahajan et al., 2014). 
Other related work includes (Nemirovski et al., 2009; 
Duchietal., 2011; Agarwal & Bottou, 2014; Zhaoetal., 
2014a; Fountoulakis & Tappenden, 2014; Tappenden et al., 
2014). We also point to (Wright, 2014) for a review on 
coordinate descent algorithms. 

Selection Probabilities. Naturally, both the theoretical 
convergence rate and practical performance of random¬ 
ized coordinate descent methods depends on the proba¬ 
bility distribution governing the choice of individual co¬ 
ordinates. While most existing work assumes uniform 
distribution, it was shown by Richtarik & Takac (2014); 
Necoara etal. (2012); Zhao & Zhang (2014) that coordi¬ 
nate descent works for an arbitrary fixed probability dis¬ 
tribution over individual coordinates and even subsets of 
coordinates (Richtarik & Takac, 2013b; Quetal., 2014; 
Qu & Richtarik, 2014; Qu & Richtarik, 2014). In all of 
these works the theory allows the computation of a fixed 
probability distribution, known as importance sampling, 
which optimizes the complexity bounds. However, such 
a distribution often depends on unknown quantities, such 
as the distances of the individual variables from their 
optimal values (Richtarik & Takac, 2014; Qu & Richtarik, 
2014). In some cases, such as for smooth strongly con¬ 
vex functions or in the primal-dual setup we consider 
here, the probabilities forming an importance sampling 
can be explicitly computed (Richtarik & Takac, 2013b; 
Zhao & Zhang, 2014; Quetal., 2014; Qu & Richtarik, 
2014; Qu & Richtarik, 2014). Typically, the theoretical in¬ 
fluence of using the importance sampling is in the replace¬ 
ment of the maximum of certain data-dependent quantities 
in the complexity bound by the average. 

Adaptivity. Despite the striking developments in the field, 
there is virtually no literature on methods using an adap¬ 


tive choice of the probabilities. We are aware of a few 
pieces of work; but all resort to heuristics unsupported by 
theory (Glasmachers & Dogan, 2013; Lukasewitz, 2013; 
Schaul et al., 2013; Banks-Watson, 2012; Loshchilov et al., 
2011 ), which unfortunately also means that the methods 
are sometimes effective, and sometimes not. We observe 
that in the primal-dual framework we consider, each 
dual variable can be equipped with a natural measure 
of progress which we call “dual residue”. We propose 
that the selection probabilities be constructed based on 
these quantities. 

Outline: In Section 2 we summarize the contributions of 
our work. In Section 3 we describe our first, theoretical 
methods (Algorithm 1) and describe the intuition behind 
it. In Section 4 we provide convergence analysis. In Sec¬ 
tion 5 we introduce Algorithm 2: an variant of Algorithm 1 
containing heuristic elements which make it efficiently im- 
plementable. We conclude with numerical experiments in 
Section 6. Technical proofs and additional numerical ex¬ 
periments can be found in the appendix. 

2. Contributions 

We now briefly highlight the main contributions of this 
work. 

Two algorithms with adaptive probabilities. We propose 
two new stochastic dual ascent algorithms: AdaSDCA (Al¬ 
gorithm 1) and AdaSDCA+ (Algorithm 2) for solving (1) 
and its dual problem (2). The novelty of our algorithms is 
in adaptive choice of the probability distribution over the 
dual coordinates. 

Complexity analysis. We provide a convergence rate anal¬ 
ysis for the first method, showing that AdaSDCA enjoys 
better rate than the best known rate for SDCA with 
a fixed sampling (Zhao & Zhang, 2014; Qu et al., 2014). 
The probabilities are proportional to a certain measure of 
dual suboptimality associated with each variable. 

Practical method. AdaSDCA requires the same com¬ 
putational effort per iteration as the batch gradient algo¬ 
rithm. To solve this issue, we propose AdaSDCA+ (Algo¬ 
rithm 2): an efficient heuristic variant of the AdaSDCA. 
The computational effort of the heuristic method in a sin¬ 
gle iteration is low, which makes it very competitive with 
methods based on importance sampling, such as IProx- 
SDCA (Zhao & Zhang, 2014). We support this with com¬ 
putational experiments in Section 6. 

Outline: In Section 2 we summarize the contributions of 
our work. In Section 3 we describe our first, theoretical 
methods (AdaSDCA) and describe the intuition behind it. 
In Section 4 we provide convergence analysis. In Sec¬ 
tion 5 we introduce AdaSDCA+: a variant of AdaSDCA 
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containing heuristic elements which make it efficiently im- 
plementable. We conclude with numerical experiments in 
Section 6. Technical proofs and additional numerical ex¬ 
periments can be found in the appendix. 

3. The Algorithm: AdaSDCA 


Algorithm 1 AdaSDCA 

Init: i>i = AjAi for i £ [n]; a 0 £ R n ; a 0 = j^Aa 0 

for t > 0 do 

Primal update: w* = S7g* (a 4 ) 

Set: a t+1 = a 4 

Compute residue k 4 : n\ = a 4 + V0j(A i r w 4 ),Vj £ \n] 


It is well known that the optimal primal-dual pair 
( w*,a *) £ R d x l n satisfies the following optimality con¬ 
ditions: 

w * = Vg* (^«*) (5) 

a* = -VMAj w*), V* £ [n\ = {1,..., n}, (6) 

where A is the d-by-n matrix with columns A \,..., A n . 
Definition 1 (Dual residue). The dual residue, n = 
(ki, ..., K n ) £ R”, associated with ( w, a ) is given by: 

Ki = a t + VMA7vj). (7) 


Compute probability distribution p 4 coherent with «: 4 
Generate random i t £ [n] according to p 4 
Compute: 

= argmax {-^* (-(a 4 t + A)) 

AgR 

i 2 } 

Dual update: a 4 + * 1 = a\ t + Aa 4 t 

Average update: a 4 = a 4 + —yp 1 Ai t 

end for 

Output: w t ,a t 


for arbitrary 


Note, that n\ = 0 if and only if at satisfies (5). This mo¬ 
tivates the design of AdaSDCA (Algorithm 1) as follows: 
whenever \n\\ is large, the ith dual coordinate a, is subop- 
timal and hence should be updated more often. 

Definition 2 (Coherence). We say that probability vector 
pt g j s co h eren t with the dual residue k 4 if for all i £ 
[n] we have 

7 ^ 0 =$■ p\ > 0 . 


Alternatively, p 4 is coherent with /c 4 if for 


It = {i € [n] : tc 4 ^ 0} C [n], 
we have minj e / t p 4 > 0. 

AdaSDCA is a stochastic dual coordinate ascent method, 
with an adaptive probability vector p 4 , which could po¬ 
tentially change at every iteration t. The primal and 
dual update rules are exactly the same as in standard 
SDCA (Shalev-Shwartz & Zhang, 2013b), which instead 
uses uniform sampling probability at every iteration and 
does not require the computation of the dual residue k. 


Our first result highlights a key technical tool which ul¬ 
timately leads to the development of good adaptive sam¬ 
pling distributions p 4 in AdaSDCA. For simplicity we de¬ 
note by E t the expectation with respect to the random index 
it £ [n] generated at iteration t. 

Lemma 3. Consider the AdaSDCA algorithm during iter¬ 
ation t > 0 and assume that p 4 is coherent with k 4 . Then 


E t [D{a t+1 ) - D(a t )] - 9 (P(w 4 ) - £>(a 4 )) 


> - 


2A n 2 


E 

ieh 


6(vi + nXy) 


— nX'y 


( 8 ) 


0 < 9 < minp 4 . (9) 

i£lt 

Proof. Lemma 3 is proved similarly to Lemma 2 
in (Zhao & Zhang, 2014), but in a slightly more general 
setting. For completeness, we provide the proof in the ap¬ 
pendix. □ 


Lemma 3 plays a key role in the analysis of stochas¬ 
tic dual coordinate methods (Shalev-Shwartz & Zhang, 
2013b; Zhao & Zhang, 2014; Shalev-Shwartz & Zhang, 
2013a). Indeed, if the right-hand side of (8) is positive, then 
the primal dual error P(w 4 ) — D(a t ) can be bounded by 
the expected dual ascent E t [D(a t+1 ) — D(a t )] times 1/6, 
which yields the contraction of the dual error at the rate of 

1 — 9 (see Theorem 7). In order to make the right-hand 
side of (8) positive we can take any 9 smaller than 0 (k 4 , p 4 ) 
where the function 9(-,-) : R” x R” —» R is defined by: 


9{k,p) 


nX TT,i:^ 0 M 2 
T li -. Ki ^ 0 Pi 1 \ K i\ 2 (. v i + nX T) 


( 10 ) 


We also need to make sure that 0 < 9 < mini e / t p 4 in 
order to apply Lemma 3. A “good” adaptive probability p 4 
should then be the solution of the following optimization 
problem: 


max 9(n t ,p ) (11) 

pgR^: 

n 

s.t. y> i = i 
2=1 

9{n t ,p) < min pi 
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A feasible solution to (11) is the importance sampling (also 
known as optimal serial sampling) p* defined by: 


* def 

Pi = 


Vi + nX'y 

EjLi {Vj + nE) ’ 


Vi G [n], 


( 12 ) 


which was proposed in (Zhao & Zhang, 2014) to obtain 
proximal stochastic dual coordinate ascent method with 
importance sampling (IProx-SDCA). The same optimal 
probability vector was also deduced, via different means 
and in a more general setting in (Qu et al., 2014). Note that 
in this special case, since p 4 is independent of the residue 
k 4 , the computation of k 4 is unnecessary and hence the 
complexity of each iteration does not scale up with n. 

It seems difficult to identify other feasible solutions to 
program ( 11 ) apart from p*, not to mention solve it ex¬ 
actly. However, by relaxing the constraint 0 (k 4 ,p) < 
min^t^Qpi, we obtain an explicit optimal solution. 

Lemma 4. The optimal solution p*(«r) of 

max Oitf.p) (13) 

perf. 

n 

s.t. y ^pi = i 

i =1 


is: 




\ K iW v i + 

E"=i \ K jW v J +n\E 


Vz G [n]. 


(14) 


Proof. The proof is deferred to the appendix. □ 


The suggestion made by (14) is clear: we should update 
more often those dual coordinates a, which have large ab¬ 
solute dual residue re 4 and/or large Lipschitz constant v t . 

If we let p 4 = p*(zc 4 ) and 6 = 0(ft 4 ,p 4 ), the constraint (9) 
may not be sastified, in which case ( 8 ) does not neces¬ 
sarily hold. However, as shown by the next lemma, the 
constraint (9) is not required for obtaining ( 8 ) when all the 
functions {(f>i}i are quadratic. 

Lemma 5. Suppose that all {<fi}i are quadratic. Let t > 0. 
Ifmiiii£i t p\ > 0, then ( 8 ) holds for any 6 G [0, +oo). 

The proof is deferred to Appendix. 

4. Convergence results 

In this section we present our theoretical complexity results 
for AdaSDCA. The main results are formulated in Theo¬ 
rem 7, covering the general case, and in Theorem 11 in the 
special case when {</>j }" =1 are all quadratic. 


4.1. General loss functions 

We derive the convergence result from Lemma 3. 

Proposition 6. Let t > 0. If niin, e /, p\ > 0 and 
0(k 4 ,P 4 ) < minig/ t p\, then 

Et [D{a t+1 ) - P>(a 4 )] > 0( K 4 ,p 4 ) (P(w*) - P(a 4 )) . 


Proof. This follows directly from Lemma 3 and the fact 
that the right-hand side of ( 8 ) equals 0 when 9 = 0 (/c 4 ,p 4 ). 

□ 


Theorem 7. Consider AdaSDCA. If at each iteration t > 
0 , mini e j t p\> 0 and (/(k 4 ,p 4 ) < minj g / t p 4 , then 


E[P(w t ) - D{a t )] < 


k=o 


(15) 


for all t > 0 where 

~ ^/ E[g( K 4 ,p 4 )(P( W 4 )-i9(a 4 ))] 
4 E[P(u> 4 ) - Z?(a 4 )] 


(16) 


Proof. By Proposition 6 , we know that 

E[D{a t+1 ) - D{a 4 )] > E[0( K t ,p 4 )(P(u; t ) - £)(«*))] 
( = J 0 t E[P(ru 4 ) — D(a t )} (17) 
>6 t E[D(a*)-D(a% 

whence 

E[D(a*) - D{a t+1 )} < (1 - 6 t ) E[D{a*) - D{a t )}. 

Therefore, 

t 

E[£>(«*) - P(a 4 )] < l[(l - k) ( D(a*) - D(a 0 )) ■ 

k =0 

By plugging the last bound into (17) we get the bound on 
the primal dual error: 

E[P( W 4 ) - D(a t )] < l- E[D(a t+1 ) - P»(a 4 )] 

9t 

< lE{D(a*) ~ 0(0.*)} 

Vt 

k—0 


As mentioned in Section 3, by letting every sam¬ 
pling probability p 4 be the importance sampling (opti¬ 
mal serial sampling) p* defined in (12), AdaSDCA re¬ 
duces to IProx-SDCA proposed in (Zhao & Zhang, 2014). 
The convergence theory established for IProx-SDCA 
in (Zhao & Zhang, 2014), which can also be derived as a 
direct corollary of our Theorem 7, is stated as follows. 
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Theorem 8 ((Zhao & Zhang, 2014)). Consider AdaSDCA 
with p l = p* defined in (12) for all t > 0. Then 

E[P( W ‘) - D (a 4 )] < 1(1 - 0*) 4 (D{a*) - D(a 0 )) , 

where 

_ nXj 

E”=iK + A7n)' 

The next corollary suggests that a better convergence rate 
than IProx-SDCA can be achieved by using properly 
chosen adaptive sampling probability. 

Corollary 9. Consider AdaSDCA. If at each iteration t > 
0, pt is the optimal solution of (11), then (15) holds and 
9 t > S* for all t > 0 . 

However, solving (11) requires large computational effort, 
because of the dimension n and the non-convex structure of 
the program. We show in the next section that when all the 
loss functions {4>i]i are quadratic, then we can get better 
convergence rate in theory than IProx-SDCA by using the 
optimal solution of (13). 

4.2. Quadratic loss functions 

The main difficulty of solving (11) comes from the inequal¬ 
ity constraint, which originates from (9). In this section we 
mainly show that the constraint (9) can be released if all 
{cpi}i are quadratic. 

Proposition 10. Suppose that all { (!),}, are quadratic. Let 
t > 0. Ifmm iG j t p\ > 0, then 

E, [ D(a t+1 ) - £>(«*)] > 0(/c 4 ,p 4 ) (P(tu 4 ) - D(a 1 )) . 

Proof. This is a direct consequence of Lemma 5 and the 
fact that the right-hand side of ( 8 ) equals 0 when 9 = 

6 >(/t‘y)- □ 

Theorem 11. Suppose that all {4>i}i are quadratic. Con¬ 
sider AdaSDCA. If at each iteration t > 0, mini 6 / t p 4 > 0, 
then (15) holds for all t > 0. 

Proof. We only need to apply Proposition 10. The rest of 
the proof is the same as in Theorem 7. □ 

Corollary 12. Suppose that all are quadratic. Con¬ 
sider AdaSDCA. If at each iteration t > 0, p t is the optimal 
solution of (13), which has a closed form (14), then (15) 
holds and 9 t > 0* for all t > 0. 

5. Efficient heuristic variant 

Corollary 9 and 12 suggest how to choose adaptive 
sampling probability in AdaSDCA which yields a the¬ 
oretical convergence rate at least as good as IProx- 
SDCA (Zhao & Zhang, 2014). However, there are two 
main implementation issues of AdaSDCA: 


1. The update of the dual residue K f at each iteration 
costs 0(nnz(A)) where nnz(A) is the number of 
nonzero elements of the matrix A; 

2. We do not know how to compute the optimal solution 
of ( 11 ). 

In this section, we propose a heuristic variant of AdaSDCA, 
which avoids the above two issues while staying close to 
the ’good’ adaptive sampling distribution. 

5.1. Description of Algorithm 


Algorithm 2 AdaSDCA+ 

Parameter a number m > 1 

Initialization Choose a 0 € R", set a 0 = j^Aa 0 

for t > 0 do 

Primal update: w 4 = X7g* (a 4 ) 

Set: a t+1 = a 4 

if mod (t, n) == 0 then 

Option I: Adaptive probability 

Compute: k\ = a\ + S7cj)t(Ajw*), Vi G [n] 
Set: p\ ~ | k 4 | y/vi+ nX'y, Mi G [n] 

Option II: Optimal Importance probability 
Set: p\ ~ (vi + nX'y), Mi G [n] 
end if 

Generate random i t G [n] according to p f 
Compute: 

Aa 4 t = argmax{-0* (-(a 4 t + A)) 

AeR 

- &IAH 

Dual update: a 4 ^" 1 = a\ t + Aa- ( 

Average update: of = a* + A f' t A if 
Probability update: 

p ! t +1 ~ Pu/m, p ) +1 ~ p 4 , Mj ^ i t 

end for 

Output: vf , a f 


AdaSDCA+ has the same structure as AdaSDCA with a 
few important differences. 

Epochs AdaSDCA+ is divided into epochs of length n. At 
the beginning of every epoch, sampling probabilities are 
computed according to one of two options. During each 
epoch the probabilities are cheaply updated at the end of 
every iteration to approximate the adaptive model. The in¬ 
tuition behind is as follows. After i is sampled and the 
dual coordinate a, is updated, the residue k, naturally de¬ 
creases. We then decrease also the probability that i is 
chosen in the next iteration, by setting p 4+1 to be propor¬ 
tional to (p[,.. -p 4 _i>.p‘/V.Pi+i> ■ • ■ ,p\ J. By doing this 
we avoid the computation of k at each iteration (issue 1 ) 
which costs as much as the full gradient algorithm, while 
following closely the changes of the dual residue n. We re- 









Stochastic Dual Coordinate Ascent with Adaptive Probabilities 


set the adaptive sampling probability after every epoch of 
length n. 

Parameter to The setting of parameter to in AdaSDCA+ 
directly affects the performance of the algorithm. If m is 
too large, the probability of sampling the same coordinate 
twice during an epoch will be very small. This will result in 
a random permutation through all coordinates every epoch. 
On the other hand, for to too small the coordinates hav¬ 
ing larger probabilities at the beginning of an epoch could 
be sampled more often than it should, even after their cor¬ 
responding dual residues become sufficiently small. We 
don’t have a definitive rule on the choice of m and we leave 
this to future work. Experiments with different choices of 
to can be found in Section 6. 

Option I & Option II At the beginning of each epoch, one 
can choose between two options for resetting the sampling 
probability. Option I corresponds to the optimal solution 
of (13), given by the closed form (14). Option II is the 
optimal serial sampling probability (12), the same as the 
one used in IProx-SDCA (Zhao & Zhang, 2014). However, 
AdaSDCA+ differs significantly with IProx-SDCA since 
we also update iteratively the sampling probability, which 
as we show through numerical experiments yields a faster 
convergence than IProx-SDCA. 

5.2. Computational cost 

Sampling and probability update During the algorithm 
we sample i £ [n] from non-uniform probability distri¬ 
bution //, which changes at each iteration. This process 
can be done efficiently using the Random Counters algo¬ 
rithm introduced in Section 6.2 of (Nesterov, 2012), which 
takes 0(?rlog(n)) operations to create the probability tree 
and 0(log(n)) operations to sample from the distribution 
or change one of the probabilities. 

Total computational cost We can compute the computa¬ 
tional cost of one epoch. At the beginning of an epoch, 
we need O(nnz) operations to calculate the dual residue 
ft. Then we create a probability tree using 0(nlog(n)) 
operations. At each iteration we need 0(log(n)) opera¬ 
tions to sample a coordinate, 0(nnz /n) operations to cal¬ 
culate the update to a and a further 0(log(n)) operations 
to update the probability tree. As a result an epoch needs 
0(nnz +n log(n)) operations. For comparison purpose we 
list in Table 1 the one epoch computational cost of compa¬ 
rable algorithms. 

6. Numerical Experiments 

In this section we present results of numerical experiments. 


Table 1. One epoch computational cost of different algorithms 


Algorithm 

COST OF AN EPOCH 

SDCA& QUARTZ(uniform) 

O(nnz) 

IProx-SDCA 

0(nnz+nlog(n)) 

AdaSDCA 

0(n ■ nnz) 

AdaSDCA+ 

0(nnz+nlog(n)) 


Table 2. Dimensions and nonzeros of the datasets 


Dataset 

d 

n 

nnz/( nd ) 

w8a 

300 

49,749 

3.9% 

DOROTHEA 

100,000 

800 

0.9% 

MUSHROOMS 

112 

8,124 

18.8% 

covl 

54 

581,012 

22% 

IJCNNl 

22 

49, 990 

41% 


6.1. Loss functions 

We test AdaSDCA and AdaSDCA+, SDCA, and IProx- 
SDCA for two different types of loss functions {0i}" =1 : 
quadratic loss and smoothed Hinge loss. Let y £ E” be the 
vector of labels. The quadratic loss is given by 

M x ) = ^~( x ~ Vi ) 2 

and the smoothed Hinge loss is: 

0 yiX > 1 

1 — yiX — 7/2 yiX < I-7 

(i-itis) otherwise, 

27 

In both cases we use /^-regularize!', i.e., 
g(w) = iIMI 2 . 

Quadratic loss functions appear usually in regres¬ 
sion problems, and smoothed Hinge loss can be 
found in linear support vector machine (SVM) prob¬ 
lems (Shalev-Shwartz & Zhang, 2013a). 

6.2. Numerical results 

We used 5 different datasets: w8a, dorothea, mushrooms, 
covl and ijcnnl (see Table 2). 


In all our experiments we used 7 = 1 and A = 1 jn. 
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AdaSDCA The results of the theory developed in Section 4 
can be observed through Figure 1 to Figure 4. AdaSDCA 
needs the least amount of iterations to converge, confirming 
the theoretical result. 

AdaSDCA+ V.S. others We can observe through Figure 15 
to 24, that both options of AdaSDCA+ outperforms SDCA 
and IProx-SDCA, in terms of number of iterations, for 
quadratic loss functions and for smoothed Hinge loss func¬ 
tions. One can observe similar results in terms of time 
through Figure 5 to Figure 14. 

Option I V.S. Option II Despite the fact that Option I is 
not theoretically supported for smoothed hinge loss, it still 
converges faster than Option II on every dataset and for ev¬ 
ery loss function. The biggest difference can be observed 
on Figure 13, where Option I converges to the machine pre¬ 
cision in just 15 seconds. 

Different choices of m To show the impact of different 
choices of m on the performance of AdaSDCA+, in Fig¬ 
ures 25 to 33 we compare the results of the two options of 
AdaSDCA+ using different m equal to 2, 10 and 50. It is 
hard to draw a clear conclusion here because clearly the op¬ 
timal m shall depend on the dataset and the problem type. 




Figure 2. dorothea dataset d = 100000, n = 800, Quadratic loss 
with 1/2 regularizer, comparing number of iterations with known 
algorithms 



Figure 3. mushrooms dataset d = 112, n = 8124, Quadratic loss 
with 1/2 regularizer, comparing number of iterations with known 
algorithms 


Figure 1. w8a dataset d = 300, n = 49749, Quadratic loss with 
L 2 regularizer, comparing number of iterations with known algo¬ 
rithms 



Figure 4. ijcnnl dataset d = 22, n = 49990, Quadratic loss with 
L 2 regularizer, comparing number of iterations with known algo¬ 
rithms 
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Figure 5. w8a dataset d = 300, n = 49749, Quadratic loss with L 2 
regularizer, comparing real time with known algorithms 



Figure 6. dorothea dataset d = 100000, n = 800, Quadratic loss 
with Z /2 regularizer, comparing real time with known algorithms 



Figure 7. mushrooms dataset d = 112, n = 8124, Quadratic loss 
with 1/2 regularizer, comparing real time with known algorithms 



Figure 8. covl dataset d = 54, n = 581012, Quadratic loss with 
1/2 regularizer, comparing real time with known algorithms 



Figure 9. ijcnnl dataset d - 22, n = 49990, Quadratic loss with 
1/2 regularizer, comparing real time with known algorithms 



Figure 10. w8a dataset d = 300, n = 49749, Smooth Hinge loss 
with Z /2 regularizer, comparing real time with known algorithms 









Stochastic Dual Coordinate Ascent with Adaptive Probabilities 



Figure 11. dorothea dataset d = 100000, n = 800, Smooth Hinge 
loss with Z /2 regularizes comparing real time with known algo¬ 
rithms 



Figure 12. mushrooms dataset d = 112, n = 8124, Smooth Hinge 
loss with L 2 regularizes comparing real time with known algo¬ 
rithms 



Figure 13. covl dataset d = 54, n = 581012, Smooth Hinge loss 
with 1/2 regularizes comparing real time with known algorithms 



Figure 14. ijcnnl dataset d = 22, n = 49990, Smooth Hinge loss 
with 1/2 regularizes comparing real time with known algorithms 



Figure 15. w8a dataset d = 300, n = 49749, Quadratic loss with 
1/2 regularizes comparing number of iterations with known algo¬ 
rithms 



Figure 16. dorothea dataset d = 100000, n = 800, Quadratic loss 
with 1/2 regularizes comparing number of iterations with known 
algorithms 
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Figure 17. mushrooms dataset d = 112, n = 8124, Quadratic loss 
with 1/2 regularizer, comparing number of iterations with known 
algorithms 


Figure 20. w8a dataset d = 300, n = 49749, Smooth Hinge loss 
with 1/2 regularizer, comparing number of iterations with known 
algorithms 



Figure 18. covl dataset d = 54, n = 581012, Quadratic loss with 
1/2 regularizer, comparing number of iterations with known algo¬ 
rithms 



Figure 21. dorothea dataset d = 100000, n = 800, Smooth Hinge 
loss with Z /2 regularizer, comparing number of iterations with 
known algorithms 




Figure 19. ijcnnl dataset d = 22, n = 49990, Quadratic loss with 
Z /2 regularizer, comparing number of iterations with known algo¬ 
rithms 


Figure 22. mushrooms dataset d = 112, n = 8124, Smooth Hinge 
loss with Z /2 regularizer, comparing number of iterations with 
known algorithms 
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Figure 23. covl dataset d = 54, n = 581012, Smooth Hinge loss 
with 1/2 regularizer, comparing number of iterations with known 
algorithms 



Figure 24. ijcnnl dataset d = 22, n = 49990, Smooth Hinge loss 
with Z /2 regularizer, comparing number of iterations with known 
algorithms 



Figure 25. w8a dataset d = 300, n = 49749, Quadratic loss with 
Z /2 regularizer, comparison of different choices of the constant m 



Figure 26. dorothea dataset d = 100000, n = 800, Quadratic loss 
with Z /2 regularizer, comparison of different choices of the con¬ 
stant m 



Figure 27. mushrooms dataset d = 112, n = 8124, Quadratic loss 
with L 2 regularizer, comparison of different choices of the con¬ 
stant m 



Figure 28. covl dataset d = 54, n = 581012, Quadratic loss with 
Z /2 regularizer, comparison of different choices of the constant m 
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Figure 29. ijcnnl dataset d = 22, n = 49990, Quadratic loss with 
1/2 regularizer, comparison of different choices of the constant m 




Figure 32. mushrooms dataset d = 112, n = 8124, Smooth Hinge 
loss with 1/2 regularizer, comparison of different choices of the 
constant m 


Figure 30. w8a dataset d = 300, n = 49749, Smooth Hinge loss 
with Z /2 regularizer, comparison of different choices of the con¬ 
stant m 




Figure 33. ijcnnl dataset d = 22, n = 49990, Smooth Hinge loss 
with Z /2 regularizer, comparison of different choices of the con¬ 
stant m 


Figure 31. dorothea dataset d = 100000, n = 800, Smooth Hinge 
loss with Z 2 regularizer, comparison of different choices of the 
constant m 








Stochastic Dual Coordinate Ascent with Adaptive Probabilities 


References 

Agarwal, Alekh and Bottou, Leon. A lower bound for the 
optimization of finite sums. arXiv:1410.0723, 2014. 

Banks-Watson, Alexander. New classes of coordinate de¬ 
scent methods. Master’s thesis. University of Edinburgh, 
2012. 

Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Si¬ 
mon. Saga: A fast incremental gradient method with 
support for non-strongly convex composite objectives. 
arXiv:1407.0202, 2014. 

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive 
subgradient methods for online learning and stochastic 
optimization. Journal of Machine Learning Research , 
12(1):2121—2159,2011. 

Fercoq, Olivier and Richtarik, Peter. Accelerated, parallel 
and proximal coordinate descent. SIAM Journal on Opti¬ 
mization (after minor revision), arXiv:1312.5799, 2013. 

Fercoq, Olivier and Richtarik, Peter. Smooth minimization 
of nonsmooth functions by parallel coordinate descent. 
arXiv:1309.5885, 2013. 

Fercoq, Olivier, Qu, Zheng, Richtarik, Peter, and Takac, 
Martin. Fast distributed coordinate descent for mini¬ 
mizing non-strongly convex losses. IEEE International 
Workshop on Machine Learning for Signal Processing, 
2014. 

Fountoulakis, Kimon and Tappenden, Rachael. Robust 
block coordinate descent. arXiv: 1407.7573, 2014. 

Glasmachers, Tobias and Dogan, Urun. Accelerated co¬ 
ordinate descent with adaptive coordinate frequencies. 
In Asian Conference on Machine Learning, pp. 72-86, 

2013. 

Jaggi, Martin, Smith, Virginia, Takac, Martin, Terhorst, 
Jonathan, Krishnan, Sanjay, Hofmann, Thomas, and Jor¬ 
dan, Michael I. Communication-efficient distributed 
dual coordinate ascent. In Ghahramani, Z., Welling, 
M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. 
(eds.). Advances in Neural Information Processing Sys¬ 
tems 27, pp. 3068-3076. Curran Associates, Inc., 2014. 

Johnson, Rie and Zhang, Tong. Accelerating stochastic 
gradient descent using predictive variance reduction. In 
NIPS, 2013. 

Konecny, Jakub and Richtarik, Peter. S2GD: Semi¬ 
stochastic gradient descent methods. arXiv: 1312.1666 , 

2014. 

Konecny, Jakub, Lu, Jie, Richtarik, Peter, and Takac, Mar¬ 
tin. mS2GD: Mini-batch semi-stochastic gradient de¬ 
scent in the proximal setting. arXiv: 1410.4744, 2014a. 


Konecny, Jakub, Qu, Zheng, and Richtarik, Peter. Semi¬ 
stochastic coordinate descent. arXiv-.1412.6293, 2014b. 

Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An accelerated 
proximal coordinate gradient method and its application 
to regularized empirical risk minimization. Technical 
Report MSR-TR-2014-94, July 2014. 

Liu, Ji and Wright, Stephen J. Asynchronous stochastic 
coordinate descent: Parallelism and convergence prop¬ 
erties. arXiv:1403.3862, 2014. 

Loshchilov, I., Schoenauer, M., and Sebag, M. Adap¬ 
tive Coordinate Descent. In et al., N. Krasnogor 
(ed.). Genetic and Evolutionary Computation Confer¬ 
ence (GECCO), pp. 885-892. ACM Press, July 2011. 

Lukasewitz, Isabella. Block-coordinate frank-wolfe opti¬ 
mization. A study on randomized sampling methods, 

2013. 

Mahajan, Dhruv, Keerthi, S. Sathiya, and Sundararajan, S. 
A distributed block coordinate descent method for train¬ 
ing 11 regularized linear classifiers. arXiv:1405.4544, 

2014. 

Mairal, Julien. Incremental majorization-minimization op¬ 
timization with application to large-scale machine learn¬ 
ing. Technical report, 2014. 

Marecek, Jakub, Richtarik, Peter, and Takac, Martin. Dis¬ 
tributed block coordinate descent for minimizing par¬ 
tially separable functions. arXiv:1406.0328, 2014. 

Necoara, Ion and Patrascu, Andrei. A random coordinate 
descent algorithm for optimization problems with com¬ 
posite objective function and linear coupled constraints. 
Computational Optimization and Applications, 57:307- 
337, 2014. 

Necoara, Ion, Nesterov, Yurii, and Glineur, Francois. Ef¬ 
ficiency of randomized coordinate descent methods on 
optimization problems with linearly coupled constraints. 
Technical report, Politehnica University of Bucharest, 
2012. 

Nemirovski, Arkadi, Juditsky, Anatoli, Lan, Guanghui, and 
Shapiro, Alexander. Robust stochastic approximation 
approach to stochastic programming. SIAM Journal on 
Optimization, 19(4): 1574-1609,2009. 

Nesterov, Yurii. Efficiency of coordinate descent methods 
on huge-scale optimization problems. SIAM Journal on 
Optimization, 22(2):341-362,2012. 

Qu, Zheng and Richtarik, Peter. Coordinate descent meth¬ 
ods with arbitrary sampling I: Algorithms and complex¬ 
ity. arXiv:1412.8060, 2014. 




Stochastic Dual Coordinate Ascent with Adaptive Probabilities 


Qu, Zheng and Richtarik, Peter. Coordinate Descent with 
Arbitrary Sampling II: Expected Separable Overapprox¬ 
imation. ArXiv e-prints, 2014. 

Qu, Zheng, Richtarik, Peter, and Zhang, Tong. Random¬ 
ized Dual Coordinate Ascent with Arbitrary Sampling. 
arXiv:1411.5873, 2014. 

Richtarik, Peter and Takac, Martin. Distributed co¬ 
ordinate descent method for learning with big data. 
arXiv:1310.2059, 2013a. 

Richtarik, Peter and Takac, Martin. On optimal prob¬ 
abilities in stochastic coordinate descent methods. 
arXiv:1310.3438, 2013b. 

Richtarik, Peter and Takac, Martin. Parallel coordi¬ 
nate descent methods for big data optimization prob¬ 
lems. Mathematical Programming (after minor revi¬ 
sion), arXiv:1212.0873, 2012. 

Richtarik, Peter and Takac, Martin. Iteration complexity of 
randomized block-coordinate descent methods for min¬ 
imizing a composite function. Mathematical Program¬ 
ming, 144(2): 1-38,2014. 

Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No more 
pesky learning rates. Journal of Machine Learning Re¬ 
search, 3(28):343—351,2013. 

Schmidt, Mark, Le Roux, Nicolas, and Bach, Francis. Min¬ 
imizing finite sums with the stochastic average gradient. 
arXiv:1309.2388, 2013. 

Shalev-Shwartz, Shai and Tewari, Ambuj. Stochastic meth¬ 
ods for fi-regularized loss minimization. Journal of Ma¬ 
chine Learning Research, 12:1865-1892,2011. 

Shalev-Shwartz, Shai and Zhang, Tong. Proximal stochas¬ 
tic dual coordinate ascent. arXiv: 1211.2717, 2012. 

Shalev-Shwartz, Shai and Zhang, Tong. Accelerated mini¬ 
batch stochastic dual coordinate ascent. In Advances 
in Neural Information Processing Systems 26, pp. 378- 
385.2013a. 

Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual 
coordinate ascent methods for regularized loss. Journal 
of Machine Learning Research, 14( 1 ):567—599, 2013b. 

Takac, Martin, Bijral, Avleen Singh, Richtarik, Peter, and 
Srebro, Nathan. Mini-batch primal and dual methods for 
svms. CoRR, abs/1303.2314, 2013. 

Tappenden, Rachael, Richtarik, Peter, and Gondzio, Jacek. 
Inexact block coordinate descent method: complexity 
and preconditioning. arXiv:1304.5530, 2013. 


Tappenden, Rachael, Richtarik, Peter, and Btike, Burak. 

Separable approximations and decomposition methods 
for the augmented lagrangian. Optimization Methods 
and Software, 2014. 

Trofimov, Ilya and Genkin, Alexander. Distributed co¬ 
ordinate descent for 11-regularized logistic regression. 
arXiv: 1411.6520, 2014. 

Wright, Stephen J. Coordinate descent al¬ 

gorithms. Technical report, 2014. URL 

http://www.optimization-online.org/DB_FILE/2014/1 

Xiao, Lin and Zhang, Tong. A proximal stochastic 
gradient method with progressive variance reduction. 
arXiv: 1403.4699, 2014. 

Zhang, Yuchen and Xiao, Lin. Stochastic primal-dual coor¬ 
dinate method for regularized empirical risk minimiza¬ 
tion. Technical Report MSR-TR-2014-123, September 
2014. 

Zhao, Peilin and Zhang, Tong. Stochastic optimization 
with importance sampling. arXiv: 1401.2753, 2014. 

Zhao, Tuo, Liu, Han, and Zhang, Tong. A general theory 
of pathwise coordinate optimization. arXiv:1412.7477, 

2014a. 

Zhao, Tuo, Yu, Mo, Wang, Yiming, Arora, Raman, and 
Liu, Han. Accelerated mini-batch randomized block co¬ 
ordinate descent method. In Ghahramani, Z., Welling, 

M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. 

(eds.). Advances in Neural Information Processing Sys¬ 
tems 27, pp. 3329-3337. Curran Associates, Inc., 2014b. 



Stochastic Dual Coordinate Ascent with Adaptive Probabilities 


Appendix 


Thus, 


Proofs 

We shall need the following inequality. 

Lemma 13. Function f : R™ —> R defined in (3) satisfies 
the following inequality: 

f{a + h)< /(a) + (V/(a), h) + A r Ah, (18) 

holds forVa, h € R ra . 


Proof. Since g is 1-strongly convex, g* is 1-smooth. Pick 


a, h G R". Since, /(a) = Xg*(j-Aa), we have 


f(a + h) = Xg*(^-Aa+^-Ah) 

An An 

<A( s - ( T^ ) + ( V 9 . ( _L^ ) ,_L^ ) + i||_L^ f 

= f(a) + (V/(a), h) + ^h T A T Ah. 


□ 


Proof of Lemma 3. It can be easily checked that the follow¬ 
ing relations hold 

1 


Vj/(a 4 ) = -A w\ Vt > 0, i G [n], (19) 

n 

giw*) + 3 *(a 4 ) = (to*, a 4 ), Vf > 0, (20) 

where {w 4 , a 4 , a 4 } t >o is the output sequence of Algo¬ 
rithm 1. Let t > 0 and 6 G [0, min,; p 4 ]. For each i G [n], 
since <pi is l/ 7 -smooth, <j>* is 7 -strongly convex and thus 
for arbitrary Si G [0,1], 

<!>*(-<4 + s»«i) 

= <t>* ((1 - Si){-a\) + SiS/fifiAjw 4 )) 

< (1 - *)#(-<# + SifiUVMAjw*)) 

7Si(l - Si)|K 4 ! 2 


( 21 ) 


We have: 

/(a 4+1 ) - /(a 4 ) 

( 18 ) 

< (V/(a 4 ),a 4+ -a 4 ) 

+ ^(a t+1 -a\A T A(a t+1 -a t )) 

= Vi/(«*) A< + -^|Aa 4 J 2 


2Xn 2 ' 

= vA > 4 A 4 + ^|A <| 2 


H(a 4 +i )-.D(a 4 ) 

- ( _a i) 


2=1 




= -iA> 4 Aa 4 -^|A<| 2 + I^(-<) 

K + a O) 

K+ A ))> 

where the last equality follows from the definition of Act 4 
in Algorithm 1. Then by letting A = s,;k/ for some 

arbitrary s,; G [0,1] we get: 


D(a 4 + 1 )-D(a 4 ) 

SiA/V«f t S 2 U it |K 4 J 2 1 


> 


1 


2Xn 2 




'> | «(-<) - <(v<MAX)) + A>H) 


s i u itl«LI , 7*i(i-*<)!«! 


t \2 


2Xn 2 2n 

By taking expectation with respect to i t we get: 

E* [D(a t+1 ) - Dio 1 )] 

n t 

> £ — - fiUVMAjw*)) + Ajw 4 « 4 ] 


-£ 


p‘ s fl K il 2 K + A 7^) , v-pb^KI 2 


2=1 


2An 2 


£ : 

i=X 


2 n 


Set 


_ 0 , i<£l t 

0/p 4 , iGl t 


(23) 


(24) 


Then s, G [0,1] for each* G [n] and by plugging it into (23) 
we get: 

E t [D(a t+1 ) - D(a 4 )] 

> - £ [0*(-a 4 ) - fi* (VMAjw*)) + Aj w 4 k 4 ] 


»e/t 


0 /"4" nXy) \ 1 ti 2 

£ - L -nX 7 ) l«i | 2 


2An 2 ' \ p, 


( 22 ) 
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Finally note that: 

P(w t ) — D{a t ) 

1 ” 

= ~E [M A J w *) + +^{g{w t )+g*{a t )) 

1 2=1 
n 

( = - E [«("“<) + MAjw*)] + -(w^Aa*) 

n n 

2=1 

1 n 

= - E [#(-“*) + A T ^v^(A>‘) 

n z —' 

2=1 

+ A J wt( 4] 

1 ” 

= -]T [^(-aD-^iVMAjw^+Ajw^^ 

n z —' 

2=1 

= ^ E [tf (-«*) - 0 :(v^(a 7^))+4v«*] 

iei t 


□ 

Proof of Lemma 4. Note that (13) is a standard constrained 
maximization problem, where everything independent of p 
can be treated as a constant. We define the Lagrangian 

n 

L(p,rj) = 0 (k,p) - „£> - 1) 

2=1 

and get the following optimality conditions: 

Inlfivi+nXj) m 2 (^+nA 7 ) 

-2- = -2-J e l n \ 

Pi Pj 

n 

Eft = i 
2=1 

Pi >0, V* G [n], 

the solution of which is (14). □ 

Proof of Lemma 5. Note that in the proof of Lemma 3, the 
condition 6 g [0, mini e / t p*] is only needed to ensure that 
Si defined by (24) is in [0,1] so that (21) holds. If <j>i is 
quadratic function, then (21) holds for arbitrary ,s,; g R. 
Therefore in this case we only need 9 to be positive and the 
same reasoning holds. □ 





